This quickstart guide explains how to join two tables A and B using Jaccard similarity measure. First, you need to import the required packages as follows (if you have installed py_stringsimjoin it will automatically install the dependencies py_stringmatching and pandas):


In [3]:
# Import libraries
import py_stringsimjoin as ssj
import py_stringmatching as sm
import pandas as pd
import os, sys

In [4]:
print('python version: ' + sys.version)
print('py_stringsimjoin version: ' + ssj.__version__)
print('py_stringmatching version: ' + sm.__version__)
print('pandas version: ' + pd.__version__)


python version: 2.7.12 |Anaconda 2.5.0 (x86_64)| (default, Jul  2 2016, 17:43:17) 
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.11.00)]
py_stringsimjoin version: 0.1.0
py_stringmatching version: 0.2.0
pandas version: 0.17.1

Joining two tables using Jaccard measure typically consists of four steps:

  1. Loading the input tables
  2. Profiling the tables
  3. Creating a tokenizer
  4. Performing the join

1. Loading the input tables

We begin by loading the two tables. For the purpose of this guide, we use the sample dataset that comes with the package.


In [5]:
# construct the path of the tables to be loaded. Since we are loading a 
# dataset from the package, we need to access the data from the path 
# where the package is installed. If you need to load your own data, you can directly
# provide your table path to the read_csv command.

table_A_path = os.sep.join([ssj.get_install_path(), 'datasets', 'data', 'person_table_A.csv'])
table_B_path = os.sep.join([ssj.get_install_path(), 'datasets', 'data', 'person_table_B.csv'])

In [6]:
# Load csv files as dataframes.
A = pd.read_csv(table_A_path)
B = pd.read_csv(table_B_path)
print('Number of records in A: ' + str(len(A)))
print('Number of records in B: ' + str(len(B)))


Number of records in A: 7
Number of records in B: 8

In [7]:
A


Out[7]:
A.id A.name A.birth_year A.hourly_wage A.address A.zipcode
0 a1 Kevin Smith 1989 30.0 607 From St, San Francisco 94107
1 a2 Michael Franklin 1988 27.5 1652 Stockton St, San Francisco 94122
2 a3 William Bridge 1986 32.0 3131 Webster St, San Francisco 94107
3 a4 Binto George 1987 32.5 423 Powell St, San Francisco 94122
4 a5 Alphonse Kemper 1984 35.0 1702 Post Street, San Francisco 94122
5 a6 1990 40.0 24th Street, San Francisco 94122
6 a7 NaN 1986 25.0 20th Street, San Francisco NaN

In [8]:
B


Out[8]:
B.id B.name B.birth_year B.hourly_wage B.address B.zipcode
0 b1 Mark Levene 1987 29.5 108 Clement St, San Francisco 94107
1 b2 Bill Bridge 1986 32.0 3131 Webster St, San Francisco 94107
2 b3 Mike Franklin 1988 27.5 1652 Stockton St, San Francisco 94122
3 b4 Joseph Kuan 1982 26.0 108 South Park, San Francisco 94122
4 b5 Alfons Kemper 1984 35.0 170 Post St, Apt 4, San Francisco 94122
5 b6 Michael Brodie 1987 32.5 133 Clement Street, San Francisco 94107
6 b7 1990 40.0 24th Street, San Francisco 94122
7 b8 NaN 1986 25.0 20th Street, San Francisco NaN

2. Profiling the tables

Before performing the join, we may want to profile the tables to know about the characteristics of the attributes. This can help identify:

a) unique attributes in the table which can be used as key attribute when performing the join. A key attribute is needed to uniquely identify a tuple.

b) the number of missing values present in each attribute. This can help you in deciding the attribute on which to perform the join. For example, an attribute with a lot of missing values may not be a good join attribute. Further, based on the missing value information you need to decide on how to handle missing values when performing the join (See the section below on 'Handling missing values' to know more about the options available for handling missing values when performing the join).

You can profile the attributes in a table using the following command:


In [9]:
# profile attributes in table A
ssj.profile_table_for_join(A)


Out[9]:
Unique values Missing values Comments
Attribute
A.id 7 (100.0%) 0 (0.0%) This attribute can be used as a key attribute.
A.name 7 (100.0%) 1 (14.29%) Joining on this attribute will ignore 1 (14.29%) rows.
A.birth_year 6 (85.71%) 0 (0.0%)
A.hourly_wage 7 (100.0%) 0 (0.0%) This attribute can be used as a key attribute.
A.address 7 (100.0%) 0 (0.0%) This attribute can be used as a key attribute.
A.zipcode 3 (42.86%) 1 (14.29%) Joining on this attribute will ignore 1 (14.29%) rows.

In [10]:
# profile attributes in table B
ssj.profile_table_for_join(B)


Out[10]:
Unique values Missing values Comments
Attribute
B.id 8 (100.0%) 0 (0.0%) This attribute can be used as a key attribute.
B.name 8 (100.0%) 1 (12.5%) Joining on this attribute will ignore 1 (12.5%) rows.
B.birth_year 6 (75.0%) 0 (0.0%)
B.hourly_wage 8 (100.0%) 0 (0.0%) This attribute can be used as a key attribute.
B.address 8 (100.0%) 0 (0.0%) This attribute can be used as a key attribute.
B.zipcode 3 (37.5%) 1 (12.5%) Joining on this attribute will ignore 1 (12.5%) rows.

If the input tables does not contain any key attribute, then you need to create a key attribute. In the current example, both the input tables A and B have key attributes, and hence you can proceed to the next step. In the case the table does not have any key attribute, you can add a key attribute using the following command:


In [11]:
B['new_key_attr'] = range(0, len(B))
B


Out[11]:
B.id B.name B.birth_year B.hourly_wage B.address B.zipcode new_key_attr
0 b1 Mark Levene 1987 29.5 108 Clement St, San Francisco 94107 0
1 b2 Bill Bridge 1986 32.0 3131 Webster St, San Francisco 94107 1
2 b3 Mike Franklin 1988 27.5 1652 Stockton St, San Francisco 94122 2
3 b4 Joseph Kuan 1982 26.0 108 South Park, San Francisco 94122 3
4 b5 Alfons Kemper 1984 35.0 170 Post St, Apt 4, San Francisco 94122 4
5 b6 Michael Brodie 1987 32.5 133 Clement Street, San Francisco 94107 5
6 b7 1990 40.0 24th Street, San Francisco 94122 6
7 b8 NaN 1986 25.0 20th Street, San Francisco NaN 7

For the purpose of this guide, we will now join tables A and B on 'name' attribute using Jaccard measure. Next, we need to decide on what threshold to use for the join. For this guide, we will use a threshold of 0.3. Specifically, the join will now find tuple pairs from A and B such that the Jaccard score over the 'name' attributes is at least 0.3.

3. Creating a tokenizer

Since Jaccard measure treats input strings as sets of tokens, we need to select a tokenizer which can be used to tokenize each string into a set of tokens. Currently, we support tokenizers from py_stringmatching package which provides five different tokenizer types: alphabetical tokenizer, alphanumeric tokenizer, delimiter-based tokenizer, qgram tokenizer, and whitespace tokenizer.

For the purpose of this guide, we will use a whitespace tokenizer. Once we have selected a tokenizer type, we need to create a tokenizer object as shown below:


In [12]:
# create whitespace tokenizer for tokenizing 'name' attribute. The return_set flag should be set to True since
# Jaccard is a set based measure.
ws = sm.WhitespaceTokenizer(return_set=True)

# a whitespace tokenizer will tokenize the input string using whitespace
ws.tokenize('William Bridge')


Out[12]:
['William', 'Bridge']

4. Performing the join

The next step after creating a tokenizer is to perform the join. The Jaccard join can be performed using the following command:


In [13]:
# find all pairs from A and B such that the Jaccard score
# on 'name' is at least 0.3.
# l_out_attrs and r_out_attrs denote the attributes from the 
# left table (A) and right table (B) that need to be included in the output.

output_pairs = ssj.jaccard_join(A, B, 'A.id', 'B.id', 'A.name', 'B.name', ws, 0.3, 
                                l_out_attrs=['A.name'], r_out_attrs=['B.name'])


0%   100%
[#####  ] | ETA: 00:00:00

In [14]:
len(output_pairs)


Out[14]:
5

In [15]:
# examine the output pairs
output_pairs


Out[15]:
_id l_A.id r_B.id l_A.name r_B.name _sim_score
0 0 a3 b2 William Bridge Bill Bridge 0.333333
1 1 a2 b3 Michael Franklin Mike Franklin 0.333333
2 2 a5 b5 Alphonse Kemper Alfons Kemper 0.333333
3 3 a2 b6 Michael Franklin Michael Brodie 0.333333
4 4 a6 b7 1.000000

Handling empty values

By default, the pairs with empty sets of tokens are included in the output. This is because Jaccard of two empty sets is not well defined and we do not want to miss any possible matches. As you can see from the previous output, the tuple pair (a6, b7) is included in the output with a similarity score of 1. If you do not want to allow pairs containing empty sets of tokens in the output, then you need to set the allow_empty flag to False as shown below:


In [16]:
output_pairs = ssj.jaccard_join(A, B, 'A.id', 'B.id', 'A.name', 'B.name', ws, 0.3, allow_empty=False,
                                l_out_attrs=['A.name'], r_out_attrs=['B.name'])


0%   100%
[#######] | ETA: 00:00:00
Total time elapsed: 00:00:00

In [17]:
output_pairs


Out[17]:
_id l_A.id r_B.id l_A.name r_B.name _sim_score
0 0 a3 b2 William Bridge Bill Bridge 0.333333
1 1 a2 b3 Michael Franklin Mike Franklin 0.333333
2 2 a5 b5 Alphonse Kemper Alfons Kemper 0.333333
3 3 a2 b6 Michael Franklin Michael Brodie 0.333333

As you can see, the tuple pair (a6, b7) is not present in the output.

Handling missing values

By default, pairs with missing values are not included in the output. This is because a string with a missing value can potentially match with all strings in the other table and hence the number of output pairs can become huge. If you want to include pairs with missing value in the output, you need to set the allow_missing flag to True, as shown below:


In [18]:
output_pairs = ssj.jaccard_join(A, B, 'A.id', 'B.id', 'A.name', 'B.name', ws, 0.3, allow_missing=True,
                                l_out_attrs=['A.name'], r_out_attrs=['B.name'])


0%   100%
[#####  ] | ETA: 00:00:000%  100%
[##] | ETA: 00:00:00
Finding pairs with missing value...
Total time elapsed: 00:00:00

In [19]:
output_pairs


Out[19]:
_id l_A.id r_B.id l_A.name r_B.name _sim_score
0 0 a3 b2 William Bridge Bill Bridge 0.333333
1 1 a2 b3 Michael Franklin Mike Franklin 0.333333
2 2 a5 b5 Alphonse Kemper Alfons Kemper 0.333333
3 3 a2 b6 Michael Franklin Michael Brodie 0.333333
4 4 a6 b7 1.000000
0 5 a7 b1 NaN Mark Levene NaN
1 6 a7 b2 NaN Bill Bridge NaN
2 7 a7 b3 NaN Mike Franklin NaN
3 8 a7 b4 NaN Joseph Kuan NaN
4 9 a7 b5 NaN Alfons Kemper NaN
5 10 a7 b6 NaN Michael Brodie NaN
6 11 a7 b7 NaN NaN
7 12 a7 b8 NaN NaN NaN
8 13 a1 b8 Kevin Smith NaN NaN
9 14 a2 b8 Michael Franklin NaN NaN
10 15 a3 b8 William Bridge NaN NaN
11 16 a4 b8 Binto George NaN NaN
12 17 a5 b8 Alphonse Kemper NaN NaN
13 18 a6 b8 NaN NaN

Enabling parallel processing

If you have multiple cores which you want to exploit for performing the join, you need to use the n_jobs option. If n_jobs is -1, all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging and is the default option. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used (where n_cpus is the total number of CPUs in the machine). Thus for n_jobs = -2, all CPUs but one are used. If (n_cpus + 1 + n_jobs) becomes less than 1, then no parallel computing code will be used (i.e., equivalent to the default).

The following command exploits all the cores available to perform the join:


In [20]:
output_pairs = ssj.jaccard_join(A, B, 'A.id', 'B.id', 'A.name', 'B.name', ws, 0.3, 
                                l_out_attrs=['A.name'], r_out_attrs=['B.name'], n_jobs=-1)


0%  100%
[# ] | ETA: 00:00:00

In [21]:
len(output_pairs)


Out[21]:
5

You need to set n_jobs to 1 when you are debugging or you do not want to use any parallel computing code. If you want to execute the join as fast as possible, you need to set n_jobs to -1 which will exploit all the CPUs in your machine. In case there are other concurrent processes running in your machine and you do not want to halt them, then you may need to set n_jobs to a value below -1.

Additional options

You can find all the options available for the Jaccard join function using the help command as shown below:


In [ ]:
help(ssj.jaccard_join)

More information

Similar to Jaccard measure, you can use the package to perform join using other measures such as cosine, Dice, edit distance, overlap and overlap coefficient. For measures such as TF-IDF which are not directly supported, you can perform the join using the filters provided in the package. To know more about other join methods as well as how to use filters, refer to the how-to guide (available from the package homepage).